# HPC Solutions – Engineering Overview

HPC and AI Innovations Lab www.hpcatdell.com

May 2021



# Dell Technologies HPC & Al Innovation Lab



**Develop**Best Practices & Solutions

Industry-focused Research with Customers and Partners

**Contribute** to the Community

## **Customers Have LOTS of Questions**













# World-class infrastructure in the Innovation Lab

13K ft.<sup>2</sup> lab, 1,300+ servers, ~10PB storage dedicated to HPC and AI in collaboration with the community

#### **Zenith**

- TOP500-class system
- Was #383, #292, #265, #396 on Top 500
- 420 servers Xeon servers, HDR100 InfiniBand
- ~900 TF combined performance!
- BeeGFS, Isilon H600 and Isilon F800 storage
- · Liquid cooled and air cooled

#### Rattler

- Mellanox, NVIDIA and Bright Computing
- ~20 GPU cluster, HDR100 InfiniBand
- BeeGFS storage

#### Minerva

- 128 AMD EPYC Rome dual socket systems, HDR InfiniBand
- BeeGFS storage

#### Other systems

· Smaller test clusters, storage solutions, etc.



# Intel Ice Lake Architecture



# Ice Lake-SP IO and Memory Hierarchy

#### **Integrating PCIe Gen4 controllers**

- New IO Virtualization design, enables up to 3x BW scaling on large payloads (2x frequency, larger TLB, supports 2M/1G pages for in translation requests)
- New P2P credit fabric implementation to reach top P2P BW targets

#### 3 independently clocked UPI links

#### 4 Memory Controllers with enhanced per channel schedulers

New memory controller design w/ optimizations

#### Intel® Total Memory Encryption (TME)

DRAM encrypted using AES-XTS 128bit

#### **Intel Optane Persistent Memory 200 Series (Barlow Pass)**

Higher speed and better power profile

#### Ice Lake SP (28 core example)



# Intel® Speed Select Technology (Intel® SST) Features

Offers a suite of capabilities to allow users to re-configure the processor – dynamically, at runtime to match the usage / WL and maximize performance









Internal Use - Confidential 7 of © Copyright 2021 Dell Inc.

## The New Dell EMC PowerEdge Server Portfolio



ITALICS: ALL NEW INTEL ICE LAKE 2-SOCKET SERVERS

#### YOUR INNOVATION ENGINE

Rectangle shape: Intel Ice Lake servers as part of HPC & Al Solutions - Bravo

Technology and solutions that help you innovate, adapt, and grow

**D¢LL**Technologies

# Interconnects

# Snoop Hold off – BIOS Option

#### Roll256Cycles

|              | WindowSize=64 |                    | WindowSize=512 |                    |
|--------------|---------------|--------------------|----------------|--------------------|
|              | Bandwidth     |                    | Bandwidth      |                    |
| Message Size | (GB/s)        | Messages/s         | (GB/s)         | Messages/s         |
| 1            | 0.0           | <mark>14 M</mark>  | 0.1            | <mark>105 M</mark> |
| 2            | 0.1           | 40 M               | 0.3            | 130 M              |
| 4            | 0.5           | 118 M              | 0.8            | 188 M              |
| 8            | 0.9           | <mark>116 M</mark> | 1.3            | <mark>162 M</mark> |
| 16           | 1.7           | 107 M              | 3.1            | 191 M              |
| 32           | 3.9           | 121 M              | 4.8            | 149 M              |
| 64           | 0.7           | 11 M               | 2.6            | 41 M               |
| 128          | 1.1           | 9 M                | 1.4            | 11 M               |

#### Roll2KCycles

|         | WindowSize=64 |            | WindowSize=512 |            |
|---------|---------------|------------|----------------|------------|
| Message | Bandwidth     |            | Bandwidth      |            |
| Size    | (GB/s)        | Messages/s | (GB/s)         | Messages/s |
| 1       | 0.2           | 156 M      | 0.2            | 204 M      |
| 2       | 0.3           | 160 M      | 0.4            | 195 M      |
| 4       | 0.6           | 160 M      | 0.8            | 191 M      |
| 8       | 1.3           | 158 M      | 1.5            | 188 M      |
| 16      | 3.0           | 189 M      | 3.1            | 191 M      |
| 32      | 4.6           | 144 M      | 4.8            | 149 M      |
| 64      | 4.5           | 71 M       | 4.7            | 73 M       |
| 128     | 7.3           | 57 M       | 8.2            | 64 M       |

- OSU Message rate test with all cores.
- Selects the number of cycles PCI I/O can withhold snoop requests, from the CPU.
- Additional SnoopHldOff options are being added to the next block BIOS releases.

## Mellanox HDR - Gotchas

- SNAPI Only: virt\_enable set to 2 in opensm.conf
- Achieve Full BW on SNAPI cards, C6520 server
  - ADVANCED PCI SETTINGS should be set to TRUE
  - To achieve full BW from local and remote socket MAX\_ACC\_OUT\_READ should be set to 16 for SNAPI cards
- On going discussion with Mellanox on HDR200 BiBW

| Numanode | Bandwidth at     |
|----------|------------------|
|          | 4MB message size |
| 0        | 39.2 GB/s        |
| 1        | 37.8GB/s         |
| 2        | 49.2 GB/s        |
| 3        | 41.3GB/s         |

```
[root@gpu105 ~]# mst status -v
MST modules:
------
MST PCI module is not loaded
MST PCI configuration module loaded
PCI devices:
-----
DEVICE_TYPE MST PCI RDMA NET NUMA
ConnectX6(rev:0) /dev/mst/mt4123_pciconf0 98:00.0 mlx5_0 net-ib0
```

# Application performance and BIOS tuning for HPC

# HPL (optimization and performance)

The best application performance can be achieved with the HPL setup bundled with Intel Parallel Studio.

In case of open-source version of HPL -

Intel MKL is recommended

Intel compiler **-qopt-zmm-usage=high -xICELAKE-SERVER** is the appropriate architecture flag, which enables AVX 512 SIMD instruction support

1 process per NUMA node is the recommended launch configuration

| CPU  | Cores | Frequency     | Performance(GFlops) | Efficiency |
|------|-------|---------------|---------------------|------------|
| 8380 | 40    | 2.3 - 3.4 GHz | 4586.80             | 0.78       |
| 6338 | 32    | 2.0 - 3.2 GHz | 3304.64             | 0.81       |
| 8280 | 28    | 2.7 – 4.0 GHz | 3308.06             | 0.68       |
| 6252 | 24    | 2.1 - 3.7 GHz | 2407.13             | 0.68       |



Internal Use - Confidential 13 of © Copyright 2021 Dell Inc.

# STREAM Dual Socket(optimization and performance)

Intel compilers are recommended to get expected performance.

Streaming/non-temporal store support is required for optimal performance numbers

Recommended compiler flags (intel compiler) -

-xICELAKE-SERVER -O3 -ffreestanding -qopenmp -qopenmp-link=static -mcmodel=medium -shared-intel -restrict -qopt-streaming-stores always -DSTREAM\_ARRAY\_SIZE=160000000 -DNTIMES=100 -DOFFSET=0 -DVERBOSE -qopt-zmm-usage=high

While running KMP\_AFFINITY environment variable should be set to "granularity=fine, scatter", following environment The system file /sys/kernel/mm/transparent\_hugepage/enabled should be set to never.

STREAM TRIAD results were generated by subscribing all available cores on system.

| CPU  | Cores | Frequency     | Performance(GB/s) | Efficiency |
|------|-------|---------------|-------------------|------------|
| 8380 | 40    | 2.3 - 3.4 GHz | 326.8             | 0.80       |
| 6338 | 32    | 2.0 - 3.2 GHz | 328.9             | 0.80       |
| 8280 | 28    | 2.7 – 4.0 GHz | 230.3             | 0.82       |
| 6252 | 24    | 2.1 - 3.7 GHz | 230.5             | 0.82       |



Internal Use - Confidential 14 of © Copyright 2021 Dell Inc.

## Power Utilization - Icelake vs Cascadelake



| CPU  | Cores | Frequency     | TDP  |  |
|------|-------|---------------|------|--|
| 8380 | 40    | 2.3 - 3.4 GHz | 270W |  |
| 6338 | 32    | 2.0 - 3.2 GHz | 205W |  |
| 8280 | 28    | 2.7 - 4GHz    | 205W |  |
| 6252 | 24    | 2.1 - 3.7GHz  | 150W |  |
|      |       |               |      |  |

# Open Issues w/ RHEL 8.3

# Issue Sighting – ICX-SP C-States with RHEL 8.3

#### **Issue Description**

- The base RHEL 8.3 kernel 4.18.0-240.el8 does not include C-state definitions for Ice Lake in the intel\_idle driver.
- This results in C-state behavior for Ice Lake that is not consistent with previous generation Intel processors and not consistent with patched kernels.

#### Resolution

- The intel\_idle driver was patched in the 4.18.0-240.11.1.el8\_3 update kernel to include Ice Lake C-state definitions.
- Recommend updating to the 4.18.0-240.11.1.el8\_3 or later kernel.

## Issue Identification

- Base kernel uses ACPI c-states when C-states are enabled in BIOS.
- Patched kernel uses intel\_idle defined C-states, which are always enabled by default.

#### Base Kernel 4.18.0-240

```
$ cpupower idle-info
CPUidle driver: intel_idle

Number of idle states: 3
Available idle states: POLL C1_ACPI C2_ACPI
POLL:
Flags/Description: CPUIDLE CORE POLL IDLE
Latency: 0
C1_ACPI:
Flags/Description: ACPI FFH INTEL MWAIT 0x0
Latency: 1
C2_ACPI:
Flags/Description: ACPI FFH INTEL MWAIT 0x20
Latency: 41
```

#### Patched Kernel 4.18.0-240.22.1

```
$ cpupower idle-info
CPUidle driver: intel idle
Number of idle states: 4
Available idle states: POLL C1 C1E C6
POTIT:
Flags/Description: CPUIDLE CORE POLL IDLE
Latency: 0
C1:
Flags/Description: MWAIT 0x00
Latency: 1
C1E:
Flags/Description: MWAIT 0x01
Latency: 4
C6:
Flags/Description: MWAIT 0x20
Latency: 128
```

# C-State Influence on Turbo Frequency Behavior

#### Processor cannot reach maximum turbo frequency without C-states

#### C-States Disabled

```
2.8 GHz, 32 threads
2.8 GHz, 30 threads
2.8 GHz, 28 threads
2.8 GHz, 26 threads
2.8 GHz, 24 threads
2.8 GHz, 22 threads
2.8 GHz, 20 threads
2.8 GHz, 18 threads
2.8 GHz, 16 threads
2.8 GHz, 14 threads
2.8 GHz, 12 threads
2.8 GHz, 10 threads
2.8 GHz, 8 threads
2.8 GHz, 6 threads
2.8 GHz, 4 threads
2.8 GHz, 2 threads
2.8 GHz, 1 thread
```

#### C-States Enabled

```
2.8 GHz, 32 threads
2.8 GHz, 30 threads
2.8 GHz, 28 threads
3.0 GHz, 26 threads
3.1 GHz, 24 threads
3.1 GHz, 22 threads
3.2 GHz, 20 threads
3.3 GHz, 18 threads
3.4 GHz, 16 threads
3.4 GHz, 14 threads
3.4 GHz, 12 threads
3.4 GHz, 10 threads
3.4 GHz, 8 threads
3.4 GHz, 6 threads
3.4 GHz, 4 threads
3.4 GHz, 2 threads
3.4 GHz, 1 thread
```

Active cores frequency behavior for Intel Xeon Platinum 8352Y

# NVIDIA GPUs





# R750xa



| CPU          | 2U2S, Ice Lake (PCIe Gen4)                                                                          |
|--------------|-----------------------------------------------------------------------------------------------------|
| Memory       | 32x DDR4<br>3200 MT/s                                                                               |
| GPU/<br>FPGA | Offering the latest GPUs by NVIDIA: A100,<br>A40 with NVLINK Bridges, M10 and T4;<br>and AMD: MI100 |
| Storage      | Up to 8 SAS/SATA SSD or NVMe drives<br>Optional BOSS                                                |
| 1/0          | Up to 8 x PCle Gen4 Slots (6 x16, 2 x8)                                                             |
| Network      | 2x1GbE LOM; 1x8 Gen4 OCP3.0                                                                         |
| Cooling      | High Performance Fans Optional Liquid cooling support for CPUs                                      |
| PSU          | 1+1 1400W, 2400W (Platinum)                                                                         |

## **R750xa Value Proposition**

PowerEdge go-to platform for GPU-optimized workloads

#### The R750xa offers:



#### **GPU Optimization**

- Massive compute power for the most complex accelerator workloads
- Intel Ice Lake CPU and PCIe 4.0 to unleash the full capabilities of GPU-base compute



#### **Workload Flexibility**

- Full-featured support for the complete stack of GPUs in the PowerEdge portfolio
- Max performance for the entire spectrum of HPC, Al-ML/DL training and inferencing, DB Analytics and VDI workloads



#### **Scalable Density**

- ➤ Air cooled 2U with ambient temperature of up to 35C
- GPU density of 2GPU/U with additional support for the newly introduced NVLINK Bridges
- Optional liquid cooling for CPUs to capture up to 20% of heat dissipation

**D&LL**Technologies

### **HPL**





- ~2x performance with A100.
- Higher double precision value with A100
- Large problem size

# Questions?

# **D** LLTechnologies

#### HPCG Results - C4140 vs R750xa



- Memory bandwidth dependent
- 900GB vs 1555GB
- ~1.7x improvement at 4-GPUs

#### **Lennard Jones**



- LAMMPS is double precision application.
- Up to 2x performance improvement with A100 GPUs

Internal Use - Confidential 25 of © Copyright 2021 Dell Inc.

# Whitley/Ice Lake 2S Overview

| CPU                                               | Ice Lake (up to 270W¹)<br>52b/57b Physical Address/Virtual Address                                                                                                                                                                                   |  |  |
|---------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|
|                                                   | Intel® DL boost: VNNI for inference. No support for BFLOAT16                                                                                                                                                                                         |  |  |
| Now Copobilities                                  | Crypto Enhancements: 2xAES, SHA Extensions, VPMADD52                                                                                                                                                                                                 |  |  |
| New Capabilities                                  | Database compression: VBMI                                                                                                                                                                                                                           |  |  |
|                                                   | Security: Intel® TME/TME-MT, Intel® SGX, PFR                                                                                                                                                                                                         |  |  |
| Socket                                            | Socket P+ 4189 pin                                                                                                                                                                                                                                   |  |  |
| Scalability                                       | 1S, 2S                                                                                                                                                                                                                                               |  |  |
| Memory                                            | 8 channels DDR4 per CPU @3200³ 2DPC,<br>16 DIMMs per socket<br>New Intel® Optane™ Persistent Memory 200 Series inpoble⁴                                                                                                                              |  |  |
| Intel® Ultra Path<br>Interconnect<br>(Intel® UPI) | Up to 3 links per CPU<br>x20, speed: 10.4 and 11.2 GT/s                                                                                                                                                                                              |  |  |
| PCle                                              | PCIe 4.0: Up to 64 lanes per CPU <sup>2</sup> (bifurcation support: x16, x8, x4) up to 48 lanes on North, 16 lanes on South of socket, NTB                                                                                                           |  |  |
| PCH – Intel® C620A<br>Series Chipset<br>(LBG-R)   | IE, Intel QAT, eSPI, No Integrated 4x10GbE/1GbE ports, Legacy 1GbE for manageability support, Up to 14 SATA 3, Up to 14 USB 2.0, Up to 10 USB 3.0, Up to 20 ports PCIe* 3.0 (8 GT/s)  (B GT/s)  (Enhanced security through hardware via new stepping |  |  |



## Sunny Cove Core Microarchitecture



|                                  | Cascade Lake<br>(per core) | Ice Lake<br>(per core)     |
|----------------------------------|----------------------------|----------------------------|
| Out-of-order Window              | 224                        | 384                        |
| In-flight Loads + Stores         | 72 + 56                    | 128 + 72                   |
| Scheduler Entries                | 97                         | 160                        |
| Register Files –<br>Integer + FP | 180 + 168                  | 280 +224                   |
| Allocation Queue                 | 64/thread                  | 70/thread;<br>140/1 thread |
| L1D Cache (KB)                   | 32                         | 48                         |
| L1D BW (B/Cyc) –<br>Load + Store | 128 + 64                   | 128 + 64                   |
| L2 Unified TLB                   | 1.5K                       | 2K                         |
| Mid-level Cache (MB)             | 1                          | 1.25                       |

- Improved Front-end: higher capacity and improved branch predictor
- Wider and deeper machine: wider allocation and execution resources + larger structures
- Enhancements in Tabs, single thread execution, prefetching
- Server enhancements larger Mid-level Cache (L2) + second FMA

# Ice Lake & Cooper Lake Product Numbering Convention for Intel® Xeon® Scalable Processors



Note: All information provided here is subject to change without notice. Intel may make changes to specifications and product descriptions at any time, without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps. For latest information please refer to the Snapshot

Datacenter Performance

Intel Confidential – For NDA Use Only

intel.

-

#### Preliminary Guidance – Example SKUs Included as Reference

